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Kohonen self-organisation maps are a well know classification tool, commonly used in a wide variety of problems, 
but with limited applications in time series forecasting context. In this paper, we propose a forecasting method 
specifically designed for multi-dimensional long-term trends prediction, with a double application of the Kohonen 
algorithm. Practical applications of the method are also presented. 



1. Introduction 

Time series forecasting is a problem encoun- 
tered in many fields of applications, as finance 
(returns, stock markets), hydrology (river floods), 
engineering (electrical consumption), etc. Many 
methods designed for time series forecasting per- 
form well (depending on the complexity of the 
problem) on a rather short-term horizon but are 
rather poor on a longer-term one. This is due to 
the fact that these methods are usually designed 
to optimize the performance at short term, their 
use at longer term being not optimized. Further- 
more, they generally carry out the prediction of 
a single value while the real problem sometimes 
requires predicting a vector of future values in 
one step. For example, in the case of some a pri- 
ori known periodicity, it could be interesting to 
predict all values for a period as a whole. But 
forecasting a vector requires either more complex 
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models (with potential loss of performance for 
some of the vector components) or many distinct 
single value predicting models (with potential loss 
of the correlation information between the various 
values). Methods able to forecast a whole vector 
with the same precision for each of its components 
are thus of great interest. 

While enlarging the prediction horizon is of 
course of primary interest for practitioners, there 
is of course some limit to the accuracy that can 
be expected for a long-term forecast. The limita- 
tion is due to the availability of the information 
itself, and not to possible limitations of the fore- 
casting methods. Indeed, there is no doubt that, 
whatever forecasting method is used, predicting 
at long term (i.e. many time steps in advance) is 
more difficult that predicting at short term, be- 
cause of the missing information in the unknown 
future time steps (those between the last known 
value and the one to predict). At some term, all 
prediction methods will thus fail. The purpose 
of the method presented in this paper is not to 
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enlarge the time horizon for which accurate pre- 
dictions could be expected, but rather to enlarge 
the horizon for which we can have insights about 
the future evolution of the series. By insights, we 
mean some information of interest to the prac- 
titioner, even if it does not mean accurate pre- 
dictions. For example, are there bounds on the 
future values ? What can we expect in average ? 
Are confidence intervals on future values large or 
narrow ? 

Predicting many steps in advance could be re- 
alized in a straightforward way, by subsampling 
the known sequence, then using any short-term 
prediction method. However, in this case, the 
loss of information (used for the forecast) is ob- 
viously even higher, due to the lower resolution 
of the known sequence. Furthermore, such so- 
lution does not allow in a general way to intro- 
duce a stochastic aspect to the method, which 
is a key issue in the proposed method. Indeed, 
to get insights about the future evolution of a se- 
ries through some statistics (expected mean, vari- 
ance, confidence intervals, quartiles, etc.), sev- 
eral predictions should be made in order to ex- 
tract such statistics. The predictions should dif- 
fer; a stochastic prediction method is able to gen- 
erate several forecasts by repeated Monte-Carlo 
runs. In the method presented in this paper, the 
stochastic character of the method results from 
the use of random draws on a probability law. 

Another attractive aspect of the method pre- 
sented in this paper is that it can be used to pre- 
dict scalar values or vectors, with the same ex- 
pected precision for each component in the case 
of vector prediction. Having at disposal a time se- 
ries of values x(t) with 1 < t < n, the prediction 
of a vector can be defined as follows : 

[a?(t+l), • • • , x(t+d)] = f(x(t), . . . , x(t-p+l))+e t 

(1) 

where d is the size of the vector to be predicted, 
/ is the data generating process, p is the num- 
ber of past values that influence the future values 
and et is a centred noise vector. The past val- 
ues are gathered in a p-dimensional vector called 
regress or. 

The knowledge of n values of the time series 



(with n » p and n » d) means that relation 
(PQ) is known for many (n — p — d+1) time steps in 
the past. The modeling problem then becomes to 
estimate a function / that models correctly the 
time series for the whole set of past regressors. 

The idea of the method is to segment the space 
of p-dimensional regressors. This segmentation 
can be seen as a way to make possible a local mod- 
eling in each segment. This part of the method 
is achieved using the Self-Organizing Map (SOM) 
PQ . The prototypes obtained for each class model 
locally the regressors of the corresponding class. 
Furthermore, in order to take into account tem- 
poral dependences in the series, deformation re- 
gressors are built. Those vectors are constructed 
as the differences between two consecutive regres- 
sors. The set of regressor deformations can also 
be segmented using the SOM. Once those two 
spaces are segmented and their dependences char- 
acterized, simulations can be performed. Using a 
kind of Monte- Carlo procedure to repeat the sim- 
ulations, it is then possible to estimate the distri- 
bution of these simulations and to forecast global 
trends of the time series at long term. 

Though we could have chosen some other clas- 
sical vector quantization (VQ) method as only the 
clustering property is of interest here, the choice 
of the SOM tool to perform the segmentation of 
the two spaces is justified by the fact that SOM 
are efficient and fast compared to other VQ meth- 
ods with a limited complexity [2] and that they 
provide an intuitive and helpful graphical repre- 
sentation. 

In the following of this paper, we first recall 
some basic concepts about the SOM classification 
tool. Then we introduce the proposed forecasting 
method, the double vector quantization, for scalar 
time series and then for vector ones. Next we 
present some experimental results for both scalar 
and vector forecastings. A proof of the method 
stability is given in appendix. 

2. The Kohonen Self-Organizing Maps 

The Self-Organizing Maps (SOM), developed 
by Teuvo Kohonen in the 80's [lj, has now be- 
come a well-known tool, with established prop- 
erties [3], [4]. Self- Organizing Maps have been 
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commonly used since their first description in a 
wide variety of problems, as classification, feature 
extraction, pattern recognition and other related 
applications. As shown in a few previous works 
0, 0, 0, 0, 0, da, the SOM may also be 
used to forecast time series at short term. 

The Kohonen Self- Organizing Maps (SOM) 
can be defined as an unsupervised classifica- 
tion algorithm from the artificial neural network 
paradigm. Any run of this algorithm results in a 
set, with a priori fixed size, of prototypes. Each 
one of those prototypes is a vector of the same di- 
mension as the input space. Furthermore, phys- 
ical neighbourhood relation links the prototypes. 
Due to this neighbourhood relation, we can eas- 
ily graphically represent the prototypes in a 1- or 
2-dimensional grid. 

After the learning stage each prototype repre- 
sents a subset of the initial input set in which 
the inputs share some similar features. Using 
Voronoi's terminology, the prototype corresponds 
to a centroid of a region or zone, each zone be- 
ing one of the classes obtained by the algorithm. 
The SOM thus realizes a vector quantization of 
the input space (a Voronoi tessellation) that re- 
spects the original distribution of the inputs. Fur- 
thermore, a second property of the SOM is that 
the resulting prototypes are ordered according to 
their location in the input space. Similar vec- 
tors in the input space are associated either to 
the same prototype (as in classical VQ) or to 
two prototypes that are neighbours on the grid. 
This last property, known as the topology preser- 
vation, does not hold for other standard vector 
quantization methods like competitive learning. 

The ordered prototypes of a SOM can easily be 
represented graphically, allowing a more intuitive 
interpretation: the 1- or 2-dimensional grid can 
be viewed as a 1- or 2-dimensional space where 
the inputs are projected by the SOM algorithm, 
even if, in fact, the inputs are rather projected on 
the prototypes themselves (with some interpola- 
tion if needed in the continuous case). This pro- 
jection operation for some specific input is pro- 
ceeded by determining the nearest prototype with 
respect to some distance metric (usually the Eu- 
clidian distance). 



3. The double quantization method 

The method described here aims to forecast 
long-term trends for a time series evolution. It 
is based on the SOM algorithm and can be di- 
vided into two stages: the characterization and 
the forecasting. The characterization stage can 
be viewed as the learning, while the forecasting 
can be viewed as the use of a model in a general- 
ization procedure. 

For the sake of simplicity, the method is first 
presented for scalar time series prediction (i.e. 
d = 1 in (pQ)) and then detailed later on for vector 
forecasting. Examples of the method application 
to scalar and vector time series will be provided 
in section [U 

3.1. Method description: characterization 

Though the determination of an optimal regres- 
sor in time series forecasting (at least in a non- 
linear prediction case) is an interesting and open 
question [11], it is considered here that the opti- 
mal, or at least an adequate, regressor of the time 
series is known. Classically, the regressor can 
for example be chosen according to some statisti- 
cal resampling (cross-validation, bootstrap, etc.) 
procedure. 

As for many other time series analysis methods, 
conversion of the inputs into regressors leads to 
n — p + 1 vectors in a p-dimensional space, where 
p is the regressor size and n the number of values 
at our disposal in the time series. The resulting 
regressors are denoted: 

x\_ p+1 = {x{t),x{t-l),...,x{t-p+l)}, (2) 

where p < t < n, and x(t) is the original time se- 
ries at our disposal with 1 < t < n. In the above 
x\_ p+l notation, the subscript index denotes the 
first temporal value of the vector, while the su- 
perscript index denotes its last temporal value. 

The obtained vectors x\_ p+l are then manip- 
ulated and the so-called deformations y\_ p+1 are 
created according to: 

Vt-p+l ~ X t-p+2 ~ x \-p+l' (3) 

Note that, by definition, each y\_ p+1 is associated 
to one of the x\_ p+1 . In order to highlight this 
link, the same indices have been used. 



4 



Putting all y\_ pJrl together in chronological or- 
der forms another time series of vectors, the defor- 
mations series in the so-called deformation space 
to be opposed to the original space containing the 
regressors x\ +1 . Of course, there exist n — p de- 
formations of dimension p. 

The SOM algorithm can then be applied to 
each one of these two spaces, quantizing both the 
original regressors x\_ pJrl and the deformations 
y\_ v+ i respectively. Note that in practice any 
kind of SOM map can be used, but it is assumed 
that one-dimensional maps (or strings) are more 
adequate in this context. 

As a result of the vector quantization by the 
SOM on all x\_ p+1 of the original space, n\ p- 
dimensional prototypes X{ are obtained (1 < i < 
ni). The clusters associated to Xi are denoted c,-. 
The second application of the SOM on all defor- 
mations y\_ p+1 in the deformation space results 
in ri2 p-dimensional prototypes y^ 1 < j < ri2. 
Similarly the associated clusters are denoted c'y 

To perform the forecasting, more information 
is needed than the two sets of prototypes. We 
therefore compute a matrix f(ij) based on the 
relations between the x\_ p+1 and the y\_ p+1 with 
respect to their clusters (q and c r - respectively). 
The row fij for a fixed i and 1 < j < rt2 is the 
conditional probability that y\_ p+1 belongs to c^-, 
given that x\_ p+1 belongs to q. In practice, those 
probabilities are estimated by the empirical fre- 
quencies: 

f _ g a and vt-p+i g c 'j} fA s 

with 1 < i < m, 1 < j < ri2. 

Note that, for a fixed z, elements (1 < j < 
712) sum to one; this justifies the fact that each 
row of the matrix is an (empirically estimated) 
probability law. Therefore the matrix will be 
called transition matrix in the following. 

The computation of this transition matrix com- 
pletes the characterization part of the method. 

3.2. Method description: forecasting 

Once the prototypes in the original and defor- 
mation spaces together with the transition matrix 
are known, we can forecast a time series evolution 
over a rather long-term horizon h (where horizon 



1 is defined as the next value t + 1 for time t) . 

The methodology for such forecasting can be 
described as follows. First, consider a time value 
x(t) for some time t. The corresponding regressor 
is x\_ p+1 . Therefore we can find the associated 
prototype in the original space, for example Xk 
(this operation is in fact equivalent to determin- 
ing the class Ck of x\_ p+1 in the SOM). We then 
look at row k in the transition matrix and ran- 
domly choose a deformation prototype y\ among 
the ijj according to the conditional probability 
distribution defined by fkj, 1 < j < ri2- The 
prediction for time t + 1 is obtained according to 
relation (|3]): 

%l-p+2 — x \-p+i + vu (5) 

where x^^ +2 ^ s ^ ne estimate of the true £^ +2 
given by our time series prediction model. How- 
ever £^p +2 is in fact a p-dimensional vector, with 
components corresponding to times from t — p + 2 
to t + 1 (see relations (|2j) and ([3])). As in the 
scalar case considered here we are only interested 
in a single estimate at time t + 1, we extract the 
scalar prediction x{t + 1) from the p-dimensional 
vector x\tl +2 . 

We can iterate the described procedure, plug- 
ging in x (t + 1) for x(t) in ([2j) to compute x^^ +3 
by ([5]) and extracting x (t + 2). We then do the 
same for x(t + 3), x(t + 4), . . . , x(t + h). This 
ends the run of the algorithm to obtain a single 
simulation of the series at horizon h. 

Next, as the goal of the method is not to per- 
form a single long-term simulation, the simula- 
tions are repeated to extract trends. Therefore 
a Monte-Carlo procedure is used to repeat many 
times the whole long-term simulation procedure 
at horizon /i, as detailed above. As part of the 
method (random choice of the deformation ac- 
cording to the conditional probability distribu- 
tions given by the rows of the transition matrix) 
is stochastic, repeating the procedure leads to dif- 
ferent simulations. Observing those evolutions al- 
lows estimating the simulation distribution and 
infer global trends of the time series, as the evolu- 
tion of its mean, its variance, confidence intervals, 
etc. 

It should be emphasized once again that the 
double quantization method is not designed to 
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determine a precise estimate for time t + 1 but 
is more specifically devoted to the problem of 
longterm evolution, which can only be obtained 
in terms of trends. 

3.3. Generalisation: vector forecasting 

Suppose that it is expected to predict vectors 
x\\i of future values of the times series x(t); x^f 
is a vector defined as: 



r t+d 



{a?(t + d),...,s(* + 2),a?(t + l)}, (6) 



where d is determined according to a priori knowl- 
edge about the series. For example when forecast- 
ing an electrical consumption, it could be advan- 
tageous to predict all hourly values for one day in 
a single step instead of predicting iteratively each 
value separately. 

As above regressors of this kind of time series 
can be constructed according to: 



1 — { x t-d+li x t-2d+l-> • • • ' X £-p+l}> (J) 



where p, for the sake of simplicity, is supposed to 
be a multiple of d though this is not compulsory. 
The regressor x\_ p+1 is thus constructed as the 
concatenation of d-dimensional vectors from the 
past of the time series, as it is the concatenation 
of single past values in the scalar case. As the 
x\_ p+1 regressor is composed of p/d vectors of 
dimension x\_ pJrl is a p-dimensional vector. 
Deformation can be formed here according to: 



?/* - r t+d 
Ut-p+i — x 



t-p+d+l 



L t-p+l' 



(8) 



Here again, the SOM algorithm can be applied 
on both spaces, classifying both the regressors 
x\_ p+l and the deformations y\_ p+1 respectively. 
We then have n\ prototypes xi in the original 
space, with 1 < i < m, associated to classes c*. 
In the deformation space, we have ri2 prototypes 
Vj, 1 < J ' < ^2, associated to classes c f -. 

A transition matrix can be constructed as a 
vector generalisation of relation (|4j): 



fi 



__ #{x\-p+i g c i and Vt-p+i 6 c 'j} 



*{x\- v+ i e Ci} 



(9) 



with 1 < i < rii, 1 < j < ri2. 

The simulation forecasting procedure can also 
be generalised: 



• consider the vector input x\_ d+1 for time t. 
The corresponding regressor is +1 ; 

• find the corresponding prototype x^\ 



• choose a deformation prototype yi among 
the ijj according to the conditional distri- 
bution given by elements fkj of row k; 



• forecast x t t ^_ p+d+1 as 



p,t+d 



(10) 



• extract the vector 



{x(t + l),x(t + 2),...,x(t + d)} 
from the d first columns of 
• repeat until horizon h. 

For this vector case too, a Monte-Carlo pro- 
cedure is used to repeat many times the whole 
longterm simulation procedure at horizon h. 
Then the simulation distribution and its statistics 
can be observed. This information gives trends 
for the long term of the time series. 

Note that using the SOM to quantize the vec- 



tors x 



t-p+i 



the method reaches the 



goal of forecasting vectors with the same precision 
for each of their components. Indeed each com- 
ponent from regressors x\_ p+l and y\_ p+1 has the 
same relative weight while the distance between 
the considered regressor and prototype is com- 
puted in the SOM algorithm. None of the x\_ p+1 
or y\_ pJrl components have thus a greater impor- 
tance in the modification of the prototype weight 
during the learning of the SOM. 

3.4. Extensions 

Two important comments must be done. 

First, as illustrated in both examples below, 
it is not mandatory (in equations (pQ), ([2]), (|6]), 
(0) to consider all successive values in the re- 
gressor; according to the knowledge of the series 
or to some validation procedure, it might be in- 
teresting to select regressors with adequate, but 
not necessarily successive, scalar values or vectors 
in the past. 

Secondly, the vector case has been illustrated 
in the previous section on temporal vectors (see 
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equation (|5J)). An immediate extension of the 
method would be to consider spatial vectors, for 
example when several series must be predicted si- 
multaneously. The equations in the previous sec- 
tion should be modified, but the principle of the 
method remains valid. 

3.5. Method stability 

The predictions obtained by the model de- 
scribed in the previous subsections should ide- 
ally be confined in the initial space defined by 
the learning data set. In that case, the series of 
predicted values y\_ p+1 is said to be stable. Oth- 
erwise, if the series tends to infinity or otherwise 
diverges, it is said to be unstable. The method 
has been proven to be stable according to this 
definition; a proof is given in appendix. 

4. Experimental results 

This section is devoted to the application of the 
method on two times series. The first one is the 
well-known Santa Fe A benchmark presented in 
P2]; it is a scalar time series. The second time 
series is the Polish electrical consumption from 
1989 to 1996 [6J. This real- world problem requires 
the prediction of a vector of 24 hourly values. 

4.1. Methodology 

In the method description, the numbers n\ and 
ri2 of prototypes have not been fixed. Indeed, 
the problem is that different values of n\ (77,2) 
result in different segmentations in the original 
(deformation) space and in different conditional 
distribution in the transition matrix. The model 
may thus slightly vary. 

Selecting the best values for m and ri2 is an 
important question too. Traditionally, such hy- 
perparameters are estimated by model selection 
procedures such as AIC, BIC or computationally- 
costly resampling techniques (Leave- One- Out, k- 
fold cross validation, bootstrap). As it will be 
shown further in this paper, exact values of n\ 
and U2 are not necessary, as the sensitivity of the 
method around the optimums is low. A simple 
validation is then used to choose adequate val- 
ues for ri\ and ri2. For that purpose the available 
data are divided into three subsets: the learning, 
the validation and the test set. The learning set 



is used to fix the values of the model parameters, 
such as the weights of the prototypes in the SOM 
and the transition matrix. The validation set is 
used to fix meta-parameters, such as the numbers 
ni and ri2 of prototypes in the SOM maps. The 
validation set is thus used for model selection. 
The test set aims to see how the model behaves 
on unused data that mimic real conditions. 

The selection of n\ and ri2 is done with regards 
to an error criterion, in our case a sum of squared 
error criterion, computed over the validation set 
VS: 

e SSE = Yl (y(t + l)-y(t+l)) 2 . (11) 

y(t+i)evs 

Once rt\ and ri2 have been chosen, a new learn- 
ing is done with a new learning set obtained 
from the reassembled learning and validation sets. 
This new learning is only performed once with op- 
timal values for ri\ and ri2. 

Note that, hopefully, the sensitivity of the 
method to specific values of n\ and ri2 is not high. 
This has been experimentally verified in all our 
simulations, and will be illustrated on the first 
example (Santa Fe A) in section l4~2l 

Another crucial question is the sensitivity of 
the method to various runs of the SOM algo- 
rithm (with the same ri\ and ri2 values). Indeed 
it is well known that initial conditions largely in- 
fluence the exact final result of the SOM algo- 
rithm (by final result it is meant the prototype 
locations, and their neighborhood relations) [T3j . 
Nevertheless, as mentioned above, the neighbor- 
hood relations of the SOM are used for visual- 
ization purposes only; they do not influence the 
results of the forecast. Moreover, the location 
of the centroids are used to quantize the space 
(therefore allowing the estimation of the empiri- 
cal conditional frequencies of the clusters); small 
variations in the centroid location have thus a 
low influence on each prediction generated by the 
method, and an even lower one on the statistics 
(mean, confidence intervals, etc.) estimated from 
the predictions. This last result has been con- 
firmed experimentally in all our simulations, for 
which no significant difference was observed after 
different runs of the two SOM algorithms. 
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4.2. Scalar forecasting: Santa Fe A 

The Santa Fe A time series [12] has been ob- 
tained from a far-infrared-laser in a chaotic state. 
This time series has become a well-known bench- 
mark in time series prediction since the Santa Fe 
competition in 1991. The completed data set con- 
tains 10 000 data. This set has been divided here 
as follows: the learning set contains 6000 data, 
the validation set 2000 data, and test set 100 
data. Note that the best neural network mod- 
els described in [12] do not predict much more 
than 40 data, making a 100-data test set a very 
long-term forecasting. 

Here, the regressors x\_ p+1 have been con- 
structed according to 

x\_ v+1 = {*(£), z(£-l),z(£- 2), 

x(t - 3), x(t - b), x(t - 6)112) 

This choice is made according to previous expe- 
rience on this series [12]. In other words, d = 1, 
p = 6 (as value x(t — 4) is omitted) and h = 100. 

In this simulation, Kohonen strings of 1 up to 
200 prototypes in each space have been used. All 
the 40 000 possible models have been tested on 
the validation set. The best model among them 
has 179 prototypes in the regressor space and 161 
prototypes in the deformation space. After re- 
learning this model on both the learning and val- 
idation sets, 1000 simulations were performed on 
a horizon of 100. Then the mean and confidence 
interval at 95% level were computed, giving infor- 
mation on the time series trends. Figure [T] shows 
the mean of the 1000 simulations compared to 
the true values contained in the test set, together 
with the confidence interval at 95% level. Figure 
[2] shows a zoom on the first 30 values. In fig- 
ure [3l we can see 100 simulations for the same 30 
values. Note the stability obtained through the 
replications. For a simpler model with ri\ = 6 
and ri2 = 8 (used for illustrations purposes), fig- 
ure [H shows the code vectors and regressors (resp. 
deformations) in each class; table [T] shows the cor- 
responding transition matrix. 

From figure [21 it should be noted that the 
method gives roughly the first 25 values of the 
time series, a result that is not so far from those 
obtained with the best neural network models of 
the Santa Fe competition [T2] . 



— True time series 

Simulation mean 

Confidence interval at 95 % level 




Time 



Figure 1. Comparison between the mean of 
the 1000 simulations (solid) and the true values 
(dashed), together with confidence intervals at 
95% level (dotted). 



- - True time series 

Simulation mean 

Confidence interval at 95 % level 




Time 



Figure 2. Comparison for the first 30 values be- 
tween the mean of the 1000 simulations (solid) 
and the true values of the test set (dashed), to- 
gether with confidence intervals at 95% level (dot- 
ted). 



From figure [H we can infer that the series mean 
will neither increase nor decrease. In addition, 
the confidence interval does contain the whole 
evolution of the time series for the considered 100 
future values. The trend for long term forecasting 
is thus that the series, though chaotic, will show 
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some kind of stability in its evolution for the next 
100 values. 

As all the 40 000 models have been generated 
and learned, the influence of varying the n\ and 
ri2 values can be observed. This influence is il- 
lustrated in figure El It is clear from this figure 
that there is a large flat region around the op- 
timal values; in this region, all models general- 
ize rather equivalent ly. This justifies, a posteri- 
ori, the choice of a simple resampling method to 
choose 774 and ri2. 





Figure 4. The code vectors and associated curves 
in the regressor (top) and deformation (bottom) 
spaces (when n\ = 6 and ri2 = 8). The code 
vectors are represented in white as 6-dimensional 
vectors (according to ([12]) ). Regressors (resp. de- 
formations) belonging to each class are shown in 
black. 



Figure 3. 100 simulations picked out at random 
from the 1000 simulations made for the Santa Fe 
A long-term forecasting. 
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Table 1 

Example of transition matrix, here with rt\ — 6 
and n2 = 8 as in figured! Note that in each row, 
the frequency values sum to one. 



Figure 5. Impact of the variation of ri\ and rt2 on 
the model generalization ability for the Santa Fe 
A time series. 



4.3. Vector forecasting: the Polish electri- 
cal consumption 

As second example, we use the Polish electrical 
load time series [6]. This series contains hourly 
values from 1989 to 1996. The whole dataset con- 
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tains about 72 000 hourly data and is plotted in 
figure [6j Due to the daily periodicity of the time 
series, we are interested in daily predictions. This 
is thus an illustration of the case d > 1, since it 
seems natural to forecast the 24 next values in one 
step (the next day), the time window becoming 
daily instead of hourly. 




Figure 6. The Polish electrical consumption time 
series, between 1989 and 1996. 



Having now at our disposal 3000 x\_ pJrl data 
of dimension 24, we use 2000 of them for the 
learning, 800 for a simple validation and 200 for 
the test. Since the optimal regressor is unknown, 
many different regressor s were tried, using intu- 
itive understanding of the process. The final re- 
gressor is: 

T t _ f^t t-24 t-48 

x t-p+l — l x t-24+l5 ^£_48+l? ^£-72+1? 

T t-144 t-168 i fio\ 

that is the 24 hourly values of today, of yester- 
day, of two, six and seven days ago. This regres- 
sor is maybe not the optimal one, but it is the 
one that makes the lowest error on the validation 
set in comparison with other tested ones. Since 
the regressor contains p = 5 data of dimension 
d = 24, we work in a 120-dimensional space. We 
then run the algorithm again on the learning set 
with values for m and ri2 each varying from 5 to 
200 prototypes by steps of 5. The lowest error is 



made by a model with m = 160 and ri2 = 140 
respectively. 

Another model is then learned with 160 and 
140 parameter vectors in each space with the new 
learning set, now containing 2000+800 data. The 
forecasting obtained from this model is repeated 
1000 times. Figure [71 presents the mean of the 
1000 simulations obtained with 24-dimensional 
vectors and with horizon ft limited to 40 days (a 
single plot of the whole 24 * 200 predicted val- 
ues becomes unreadable). For convenience, fig- 
ure [8] shows a zoom and a comparison between 
the mean of those 1000 long-term predictions and 
the real values. A confidence interval at 95% level 
is also provided. 



Predicted mean of the time series with 1000 simulations 




Time 



Figure 7. Mean of the 1000 simulations at long 
term (ft 40). 



From figure [51 it is clear that the mean of the 
prediction at long term will show the same peri- 
odicity as the true time series and that the values 
will be contained in a rather narrow confidence 
interval. This fact denotes a probable low varia- 
tion of the series at long term. 

Figure [9] shows 100 predictions obtained by the 
Monte- Carlo procedure picked up at random be- 
fore taking the mean. See that different simula- 
tions have about the same shape; this is a main 
argument for determining long-term trends. 

Finally, as in the previous example, the influ- 
ence of TT-i and ri2 can be observed. In figure [10J 



10 




Figure 8. Comparison between the true values 
(dashed), the mean of the predictions (solid) and 
the confidence interval at 95 % level (dotted). 



Figure 10. Impact of the variation of n\ and ri2 
on the model generalization ability for the Polish 
electrical consumption problem. 




Time 



Figure 9. Plot of 100 simulations chosen at ran- 
dom from the 1000 simulations. 



a very large flat region is also present around the 
best model. Sub-optimal selection of the n\ and 
77,2 values will thus not penalize too heavily the 
model generalization abilities. 

5. Conclusion 

In this paper, we have presented a time series 
forecasting method based on a double classifica- 
tion of the regressors and of their deformations 
using the SOM algorithm. The use of SOMs 



makes it possible to apply the method both on 
scalar and vector time series, as discussed in sec- 
tion [3] and illustrated in section [H A proof of the 
method stability is given in appendix. 

The proposed method is not designed to obtain 
an accurate forecast of the next values of a series, 
but rather aims to determine long-term trends. 
Indeed, its stochastic nature allows repeating sim- 
ulations by a Monte- Carlo procedure, allowing to 
compute statistics (variance, confidence intervals, 
etc.) on the predictions. Such a method could 
also be used for example in the financial context, 
for the estimation of volatilities. 
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Appendix 
Method stability 

Intuitively, the stability property of the method 
is not surprising. Indeed, the model is designed 
such that it will mostly produce predictions that 
are in the range of the observed data. By con- 
struction, deformations are chosen randomly ac- 
cording to an empirical probability law and the 
obtained predictions should stay in the same 
range. If, for some reason, the prediction is about 
to exceed this range during one of the simulations, 
the next deformations will then tend to drive it 
back inside this range, at least with high proba- 
bility. Furthermore, as simulations are repeated 
with the Monte-Carlo procedure, the influence of 
such unexpected cases will be reduced when the 
mean is taken to obtain the final predictions. The 
following of this section is intended to prove this 
intuitive result. 

The proof consists in two steps: it is first 
shown that the series generated by the model 
is a Markov chain; secondly, it is demonstrated 
that this particular type of Markov chain is sta- 
ble. In order to improve the readability of the 
proof, lighter notations will be used. For a fixed 
d and a fixed p, notation X t will represent the 
vector x\_ p+1 . The last known regressor will be 
denoted Xo. The prototype of a cluster of de- 
formations will be noted Yj . Finally, hats will 
be omitted for simplicity as all regressors X t are 
estimations, except for t = 0. 
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To prove that the series is a Markov chain, we 
consider the starting vector of the simulation at 
time 0. The corresponding initial regressor of the 
series is denoted Xq, and Co is the corresponding 
SOM cluster in the regressor space. The defor- 
mation that is applied to X at this stage is Yq. 
Then the next values of the series are given by 
X 1 = X + y , X 2 = X + Y + Y u . . . , with 
Yq , Y\ , ... drawn randomly from the transition 
matrix for clusters Co , C\ , ... respectively. The 
series Xt is therefore a Markov chain, homoge- 
neous in time (the transition distribution are not 
time dependant), irreducible and defined over a 
numerable set (the initial X t are in finite num- 
ber, and so are the deformations). 

To show the stability of this Markov chain and 
thus the existence of a stationary distribution, 
Foster's criterion [14] is applied. Note that this 
criterion is a stronger result which proves the er- 
godicity of the chain, which in turns implies the 
stability. Foster's criterion is the following: 

A necessary and sufficient condition for an ir- 
reducible chain to be ergodic is that there exists 
a positive function g(.), a positive e and a finite 
set A such that: 



Mx £ Q : E(g(X t +i)\X t = x) < oo, 
£ fi : E(g(X t+1 )\X t = x) - g(x) < 



-e. 
(14) 



Since the Markov chain is homogenous, it is 
sufficient to observe transition Yq from Xq to X\ . 
The same development can be deduced for any 
other transition. 

The demonstration is done for two-dimensional 
regressors but can be generalized easily to other 
dimensions. Note that in the following, we use 

Hi-fin©- 

Before going in further details, let us remark 
that for a SOM with at least 3 classes in general 
position, class Co covers less than a half plane. 
Furthermore, we have to distinguish two cases for 
each cluster. First, the cluster may be included in 
a finite compact from IR 2 . The second case is the 
case of an infinite cluster i.e. of a cluster which 
may does have any neighbour in some direction; 
this happens to ckusters on the border of the map. 

The first case is easely proved. Since ||Xo|| < 





(a) A Cluster within an (b) A Cluster within an ob- 
acute angle tuse angle 



Figure 11. Notations for the cone containing an 
unbounded cluster of a SOM; see text for details. 



Ro, where Ro can be any constant, then we have 
by triangular inequality: 



E(\\Xi\ 



< 
< 



Ro 
Ro 



11*0 

max 



(15) 



As the deformations Yj are in finite number, the 
maximum of their norm is finite. This proves the 
first inequality of (jHj) in an obvious way for the 
first case (i.e. bounded cluster case). 

The other case thus appens when ||Xo|| — > +oo. 
This happens in unbounded clusters. The un- 
bounded cluster case is much more technical to 
prove. 

Looking at figure [TTJ we see that each un- 
bounded cluster is included in a cone with ver- 
tex A and delimited by the normalized vectors 
ai and <i2- There are two possibilities: either a\ 
and d2 form an acute angle, either an obtuse one, 
as shown in figure ll(a)| and figure ll(b)| respec- 
tively. 

Before going on and applying Foster's criterion, 
note that the three following geometrical proper- 
ties can be proven: 



Property 1. 

Denoting 

lim 



INI 



• ai 



(16) 



we have 5\ and 62 both positive in the acute angle 
case, while either Si or 82 is positive for an obtuse 
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angle. Indeed, using the origin O, we define: 

Ox = OA + Ax. (17) 
We thus have: 

x OA-di Ax \\~Ax\\ 



\Ax\\ \\ x \\ 



CLi (18) 



which can be bounded by a strictly positive con- 
stant as and &- 1 for Ibll -> -foe. 
INI INI 11 11 

Property 2. 

We define 6i such that the angle (ai, 6i) is -ff. 
Similarly 62 is defined such that the angle (62, (22) 



is also 



r. Then, for both the acute and obtuse 



angle cases, we have: 



inf • bi 

xec \\x\\ 



Ti > 0, 



(19) 



where C is the considered cone which has border 
vectors a\ and a^. 

Rewrite the first term of (fT9l) as: 



. £ Ax 
mf --— 

xec \\x\\ 



bj — inf 



Ax \\Ax\\ 



^ c \\Ax\\ \\A\ 



(20) 



W - 1 when 



the result is obtained easily since 
H -+oo. 

Property 3. 

Assume that: 



#mo(*o) ■ ai < and E Mo (Y ) • a 2 < (21) 

where /io is the empirical distribution correspond- 
ing to class Cq in the transition matrix. Denoting 



E^ (Y ) •a i = -j i <0 
with ji > 0, then we have: 

E^M-bi <0 



(22) 



(23) 



for either i = 1 or z = 2 in case of an acute angle 
(figure [12(a) ) or for both of i = 1 and z = 2 for 



the obtuse case (figure [12 (b) ). 



Note that the initial assumption can easily be 
proved numerically. 

Those three properties will be used as lemmas 
in the following. Now we can apply Foster's cri- 
terion for the unbounded cluster case. 




(a) Acute angle case. (b) Obtuse angle case. 



Figure 12. Third geometrical property, see text 
for details. 



Foster's criterion 

Considering an unbounded class Co and 
the corresponding transition distribution, with 
g(x) = ||x|| 2 , we have 



E{g{X 1 )\Xo = x)-g{x) 

= E(g(Xo + Yo\Xo = x)-g(x) 
= E(\\Xo + Yo\\ 2 \X = x)-\\x 

x-E^jYp) iWIIYoll 2 ) ' 



211x11 



211*11 



(24) 



The second term between the brackets can be 
bounded by a strictly positive constant ao- In- 
deed, as ||Fo|| 2 is finite, ^ (ll^o|| 2 ) < M is also 
finite. Therefore, for ao > and ||x|| > we 
have 



1 



^^o(ll^)|| 2 )<«0. 



(25) 



For the first term, we chose either i = 1 or i = 2 
such that: 



x 

lim ~r. 77 ■ ai = Si > 0, 
IMI-+00 ||x|| 

E w (Yo) ■ h < 0. 



(26) 



In case of an unbounded cluster, those two con- 
ditions are fulfilled using Properties 1. and 3. 

By hypothesis, suppose that i = 2 satisfies 
those two conditions (|26|) . The term E^ ( Yq ) can 
be decomposed in the (62,^2) basis. Then, for 
1 1 #|| sufficiently large, as: 

• E flQ (Yo) ■ a 2 = -72 by Property 3.; 
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• jj^jj- • a 2 > y by Property 1.; 

• E^Yq) • b 2 < by Property 3.; 



n— [7 • h > — as Ox 
\x\\ 2 



OA + Ax and by 



Property 2.; 
we have 



< (^o(^b) -a 2 ) { tttt -a 2 



(E, (Y )-b 2 )[-r--b 2 



<o 



< -72y, 

when ||x|| is large enough, denoted here ||x|| > Lq. 

The same development can be achieved using 
i = 1 to satisfy the two initial conditions ([26]) . 
We obtain: 



M ^ o( y )<- 7l - 

when ||x|| > L' . 

Equation (j24j) can now be simplified in: 



(27) 



Finally, we obtain, using ([29]) in 

E(g(X 1 )\X =x)- g(x) < -a\\x\\, (30) 
where the right member tends to — oo for \\x\\ — > 

To conclude, we define the set Vt used in Fos- 
ter's criterion according to 

n=(\JC^\J{X \\\X \\<K}, (31) 

where / denotes the set of bounded cluster in- 
dexes as discussed in the introduction to the 
proof. With this definition, the above develop- 
ments prove Foster's criterion ([T4|) . Thus the 
Markov chain defined by the Xi for i > is er- 
godic, and admits a unique stationary distribu- 
tion. 



E(g(X 1 )\X = x)-g(x) 



= 2\\x\\ 

<2\\x\\ 
= -2\\x 



INI " h 

-a + \ao\ 

I «o 
I 2 ' 



llVoll 2 ) 



2||s|| 



(28) 



where ||x|| > Ko = max(Lo,Lo) and ao in ([25|) is 
chosen such that ao = min ^ :} ^r I , :r rr 1 ^. 

This development has been done for cluster Co . 
All values c^o, Md, ^o, Ko depends on this cluster 
Co- Now considering all unbounded clusters C{ 
and taking a = inf^. and if = sup c . we 
have: 



V||s|| >if : 

^ MO (F ) , ^odl^oll 2 ) 



INI 



2H 



< 



"2 <0 - 



(29) 



